Regression

Fitting data.

Formula

$$ \begin{aligned} \underline{w} &= \arg\max_{\underline{w}} \mathcal{L}(\underline{w}) \\ &= \arg\max_{\underline{w}} \sum_{i=1}^{N} \ell(\underline{w}, \underline{x}^{(i)}; t^{(i)}) \\ &= \arg\max_{\underline{w}} \sum_{i=1}^{N} \left[ \Phi(\underline{x}^{(i)}) \cdot \underline{w} - t^{(i)} \right]^2 \end{aligned} $$

One way to deal with functions that aren’t linearly separable is to augment your feature space(by $\Phi$) so that the target function is linearly separable in the augmented space.

Problem: Overfit

Too many features and overfit, because data is not enough.

Solution: Cross Validation and Regularization.

Cross Validation

Try different feature size and determine where overfit starts. It requires a lot of data and is slow: if data is scarce, we would have fewer data than feature size after separation; if data is abundant, each round takes a lot of time. Also it is impractical when certain features are of interest.

Regularization

Penalize complicated answers. It trades training performance against solution complexity. $$ \begin{aligned} \underline{w} &= \arg\max_{\underline{w}} \left[ \lambda \Vert\underline{w}\Vert_2 + \mathcal{L}(\underline{w}) \right] \\ &= \arg\max_{\underline{w}} \left( \lambda \Vert\underline{w}\Vert_2 + \sum_{i=1}^{N} \left[ \Phi(\underline{x}^{(i)}) \cdot \underline{w} - t^{(i)} \right]^2 \right) \end{aligned} $$

Natation:

Norm Ball: $\Vert\underline{w}\Vert_{q = \{0.5, 1, 2, 3\}}$. Different $q$ values represents different ball-shape boundary of $\underline{w}$. The boundary represents limit of combinations of $\underline{w}$ components.
$L_2$ regularization: $q=2$. Favors many small weights. Have Bayesian counterpart.
$L_1$ Regularization: $q=1$. LASSO. Tends to pick sparse weights ($\underline{w}$ components). Often used in compressed sensing, where a sparse combination of basis functions that are consistent with observations are intended. Goal in this field are often formulated as an $L_1$ minimization problem. Have Bayesian counterpart.
$L_0$ Norm: Number of non-zero elements in a vector. Best for sparseness. $L_1$ norm can be seen as a reasonable approximation.

Other ways to get sparseness: Forward Selection(starting with a small feature set and gradually add features until performance stops improving in cross validation) and Backward Elimination(starting with all features and gradually remove features). The issue is that both methods are greedy and can be slow.

Problem: Bias-Variance Tradeoff

Model with low bias usually have high variance. For instance, a squiggly line that fits train data well (low bias) might have a high variance (difference in fit between train set and test set). The idea model has low bias and low variance.

Terms:

Variance: The difference in fits between train set and test set.
Bias: Fits for the train set.

Situations:

High bias, low variance algorithm trains model are consistent but inaccurate on average, e.g. regression, naïve Bayes, etc.
High variance, low bias algorithm trains model are accurate on average but inconsistent, e.g. decision tree, nearest neighbors, etc.

When data is sparse, variance can be a problem. Restricting hypothesis space can reduce variance at the cost of increasing bias. When data are plentiful, variance is less of concern.

Bayesian View

$$ \begin{aligned} P[D | H] \cdot P[H] &= P[\underline{y} | f(X; \underline{w}) ] \cdot P[\underline{w}] \\ &= \prod_{i=1}^{N} \frac{e^{-\frac{[y^{(i)} - f(\underline{x}^{(i)})]^2}{2 \sigma^2}}}{\sqrt{2\pi \sigma^2}} \cdot P[\underline{w}] \\ &= \prod_{i=1}^{N} \frac{e^{-\frac{[y^{(i)} - f(\underline{x}^{(i)})]^2}{2 \sigma^2}}}{\sqrt{2\pi \sigma^2}} \cdot \frac{e^{-\frac{\alpha \underline{w}^T \cdot \underline{w}}{2}}}{\frac{2\pi^{\frac{(k+1)}{2}}}{\alpha}} \\ \log P[D | H] \cdot P[H] &\approx \sum_{i=1}^{N} [y^{(i)} - f(\underline{x}^{(i)})]^2 + \lambda \Vert \underline{w} \Vert_2 \end{aligned} $$